Morphemes as Necessary Concept for Structures Discovery from Untagged Corpora

نویسنده

Hervé Déjean

چکیده

This paper describes an overview of a method which allows discovery of syntactic structures from untagged corpora. It is composed of three main steps: the discovery of the grammatical morphemes of the language. Then the construction of the chunks which axe a multilingual conceptual level allowing the bypass of the limping notion of words. And Finally the discovery of the relations between chunks. We give an overview of the ditferent procedures realized and we especially describe the discovery of morphemes. This operation is divided into three steps: the discovery of the most frequent morphemes of the language. Then the discovery of the other morphemes, and finally the segmentation of the words of the corpus. We concluded with the procedure of correction which required the chunk level. The concepts and algorithms were tested on a twenty nat

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unsupervised Discovery of Persian Morphemes

This paper reports the present results of a research on unsupervised Persian morpheme discovery. In this paper we present a method for discovering the morphemes of Persian language through automatic analysis of corpora. We utilized a Minimum Description Length (MDL) based algorithm with some improvements and applied it to Persian corpus. Our improvements include enhancing the cost function usin...

متن کامل

A Re-estimation Method for Stochastic Language Modeling from Ambiguous Observations

This paper describes a reestimation method for stochastic language models such as the N-gram model and the Hidden Maxkov Model(HMM) from ambiguous observations. It is applied to model estimation for a tagger from a~ untagged corpus. We make extensions to a previous algorithm that reestimates the N-gram model from an untagged segmented language (e.g., English) text as training data. The new meth...

متن کامل

Move Structures in “Statement-of-the-Problem” Sections of M.A. Theses: The Case of Native and Nonnative Speakers of English

Understanding how to structure the “Statement-of-the-Problem” (SP) section of a thesis is necessary for EFL students to develop a logical argumentation for a problem statement. This study intended to compare Move structures of SP sections of theses written by native speakers of Persian (NSPs) and English (NSEs). To this end, 100 SP sections (50 SP sections written by NSE...

متن کامل

Generalized unknown morpheme guessing for hybrid POS tagging of Korean

Most of errors in Korean morphological analysis and POS (Part-of-Speech) tagging are caused by unknown morphemes. This paper presents a generalized unknown morpheme handling method with P OSTAG (POStech TAGger) which is a statistical/rule based hybrid POS tagging system. The generalized unknown morpheme guessing is based on a combination of a morpheme pattern dictionary which encodes general le...

متن کامل

A 3-Steps Algorithm for Morphological Disambiguation Using Untagged Corpora

This article presents a three steps algorithm for morphological disambiguation between the definite article and the personal pronoun in French language. Tested accuracy in a large untagged corpora exceeds 98% with less than 1% of error. Our method has been also experimented on unlabeled Greek corpora and the results prove the system’s portability to other languages with similar structure. Not a...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1998

Morphemes as Necessary Concept for Structures Discovery from Untagged Corpora

نویسنده

چکیده

منابع مشابه

Unsupervised Discovery of Persian Morphemes

A Re-estimation Method for Stochastic Language Modeling from Ambiguous Observations

Move Structures in “Statement-of-the-Problem” Sections of M.A. Theses: The Case of Native and Nonnative Speakers of English

Generalized unknown morpheme guessing for hybrid POS tagging of Korean

A 3-Steps Algorithm for Morphological Disambiguation Using Untagged Corpora

عنوان ژورنال:

اشتراک گذاری